Phrase-based data classification system

ABSTRACT

A method of classifying data is disclosed. Text data items are received. A set of classes into which the text data items are to be classified is received. A phrase-based classifier to classify the text data items into the set of classes is selected. The phrase-based classifier is applied to classify the text data items into the classes. Here, the applying includes creating a controlled vocabulary pertaining to classifying the text data items into the set of classes, building phrases based on the text data items and the controlled vocabulary, and classifying the text data items into the set of classes based on the phrases.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No.61/553,140 filed Oct. 28, 2011, entitled “PHRASE-BASED DOCUMENTCLASSIFICATION SYSTEM,” which is incorporated herein by reference in itsentirety.

TECHNICAL FIELD

This application relates generally to the technical field of dataclassification and, in one specific example, to classification ofself-descriptive job titles.

BACKGROUND

Various algorithms (or classifiers) are used to classify data, includinglinear classifiers, support vector machines (SVMs), kernel estimation,neural networks, Bayesian networks, and other classifiers. Metrics forevaluating classifiers include precision, recall (or coverage), receiveroperating characteristic (ROC) curves, explainability of classificationresults, development costs, or maintenance costs. A performance of aclassifier may depend on the characteristics of the data beingclassified. Furthermore, it may be difficult to identify a relationshipbetween the data to be classified and the performance of the classifier.

On a social networking site (e.g., Facebook or LinkedIn), users maymanage their own profiles. For example, users may use different forms ofnatural language (e.g., words and phrases) to describe things that fitinto the same category. For example, users of the social networking sitemay be free to specify their job titles in any form they wish. In a dataset of 100 million such job titles, it may not be unusual for users tospecify a job title corresponding to a job category (e.g., “SoftwareEngineer”) in 40,000 different ways.

For various reasons (e.g., in order to boost revenues earned by thesocial networking site through targeted advertising), owners (oradministrators) of the social networking site may wish to classify theirusers' self-descriptive job titles (or other user-specified data) into agenerated or determined set of categories (e.g., a few dozencategories). For job title classification, examples of such categoriesmay be Engineering, Marketing, Sales, Support, Healthcare, Legal,Education, and so on.

It may be possible to use standard classifiers (e.g., SVMs) to solvesuch a data classification problem. However, standard classifiers maynot solve the problem as effectively as another classifier, such as aclassifier that is specifically targeted for the problem, especially inview of a particular set of performance metrics that is selected by theowners or administrators of the social networking site.

BRIEF DESCRIPTION OF THE DRAWINGS

Some embodiments are illustrated by way of example and not limitation inthe figures of the accompanying drawings in which:

FIG. 1 is a network diagram depicting a client-server system, withinwhich various example embodiments may be deployed;

FIG. 2 is a set of graphs illustrating a plotting of

T*| as a function of |V*|, plot |T**| as a function of |V**|, and[(|T**|)/(|V**|)] as a function of |V**| for a test set of data;

FIG. 3 is a set of graphs illustrating a plotting of a number of ngramsadded in comparison to a length of ngrams over multiple processingsteps;

FIG. 4 is a block diagram illustrating an example embodiment of a systemfor phrase-based classification;

FIG. 5 is a flow chart illustrating an example method of offline phraseclassification;

FIG. 6 is a table illustrating a sample of common ways in which the jobtitle corresponding to the category “Software Engineer” is specified byusers in their profiles on a social networking site;

FIG. 7 is a table illustrating an example of the results of applying anSVM classifier to a job title classification task; and

FIG. 8 is a block diagram of a machine in the example form of a computersystem within which instructions for causing the machine to perform anyone or more of the methodologies discussed herein may be executed.

DETAILED DESCRIPTION

In the following description, for purposes of explanation, numerousspecific details are set forth in order to provide an understanding ofvarious embodiments of the inventive subject matter. It will be evident,however, to those skilled in the art that embodiments may be practicedwithout these specific details. Further, to avoid obscuring theinventive concepts in unnecessary detail, well-known instructioninstances, protocols, structures, and techniques have not been shown indetail. As used herein, the term “or” may be construed in an inclusiveor exclusive sense, the term “user” may be construed to include a personor a machine, and the term “interface” may be construed to include anapplication program interface (API) or a user interface.

We present a data (or document) classification system that employs lazylearning from labeled phrases. The system can be highly effective, suchas whenever the following property holds: most of information ondocument labels is captured in phrases. We call this property nearsufficiency. Our research contribution is twofold: (a) we quantify thenear sufficiency property using the Information Bottleneck principle andshow that it is easy to check on a given dataset; (b) we reveal that inall practical cases—from small-scale to very large-scale—manual labelingof phrases is feasible: the natural language constrains the number ofcommon phrases composed of a vocabulary to grow linearly with the sizeof the vocabulary. Both these contributions provide firm foundation toapplicability of the phrase-based classification (PBC) framework to avariety of large-scale tasks. An example implementation of the PBCsystem may be focused on the task of job title classification. Such animplementation may be incorporated, for example, into a datastandardization effort of a social networking site. The system maysignificantly outperform its predecessor both in terms of precision andcoverage. The system may be used in conjunction with an ad targetingproduct or other applications. The PBC system may excel in highexplainability of the classification results, as well as in lowdevelopment and low maintenance costs. In various embodiments,benchmarking of the PBC against other high-precision documentclassification systems (e.g., using different algorithms) may show thatPBC is most useful in multilabel classification.

In various embodiments, a method of classifying data is disclosed. Textdata items are received. A set of classes into which the text data itemsare to be classified is received. A phrase-based classifier to classifythe text data items into the set of classes is selected. Thephrase-based classifier is applied to classify the text data items intothe classes. Here, the applying includes creating a controlledvocabulary pertaining to classifying the text data items into the set ofclasses, building phrases based on the text data items and thecontrolled vocabulary, and classifying the text data items into the setof classes based on the phrases.

FIG. 1 is a network diagram depicting a client-server system 100, withinwhich various example embodiments may be deployed. A networked system102, in the example forms of a network-based social-networking site orother communication system, provides server-side functionality, via anetwork 104 (e.g., the Internet or Wide Area Network (WAN)) to one ormore clients. FIG. 1 illustrates, for example, a web client 106 (e.g., abrowser, such as the Internet Explorer browser developed by MicrosoftCorporation of Redmond, Wash.) and a programmatic client 108 executingon respective client machines 110 and 112. Each of the one or moreclients 106, 108 may include a software application module (e.g., aplug-in, add-in, or macro) that adds a specific service or feature to alarger system.

An API server 114 and a web server 116 are coupled to, and provideprogrammatic and web interfaces respectively to, one or more applicationservers 118. The application servers 118 host one or moresocial-networking applications 120. The application servers 118 are, inturn, shown to be coupled to one or more databases servers 124 thatfacilitate access to one or more databases or NoSQL or non-relationaldata stores 126.

The social-networking applications 120 may provide a number ofsocial-networking functions and services to users that access thenetworked system 102. While the social-networking applications 120 areshown in FIG. 1 to form part of the networked system 102, in alternativeembodiments, the social-networking applications 120 may form part of aservice that is separate and distinct from the networked system 102.

Further, while the system 100 shown in FIG. 1 employs a client-serverarchitecture, various embodiments are, of course, not limited to such anarchitecture, and could equally well find application in a distributed,or peer-to-peer, architecture system, for example. The varioussocial-networking applications 120 could also be implemented asstandalone software programs, which do not necessarily have computernetworking capabilities. Additionally, although FIG. 1 depicts machines130, 110, and 112 as being coupled to a single networked system 102, itwill be readily apparent to one skilled in the art that machines 130,110, and 112, as well as applications 128, 106, and 108, may be coupledto multiple networked systems. For example, the applications 128, 106,and 108 may be coupled to multiple social-networking applications 120,such as payment applications associated with multiple payment processors(e.g., Visa, MasterCard, and American Express).

The web client 106 accesses the various social-networking applications120 via the web interface supported by the web server 116. Similarly,the programmatic client 108 accesses the various services and functionsprovided by the social-networking applications 120 via the programmaticinterface provided by the API server 114. The programmatic client 108may, for example, perform batch-mode communications between theprogrammatic client 108 and the networked system 102.

FIG. 1 also illustrates a third party application 128, executing on athird party server machine 130, as having programmatic access to thenetworked system 102 via the programmatic interface provided by the APIserver 114. For example, the third party application 128 may, utilizinginformation retrieved from the networked system 102, support one or morefeatures or functions on a website hosted by the third party. The thirdparty website may, for example, provide one or more promotional,social-networking, or payment functions that are supported by therelevant applications of the networked system 102.

Automatic classification of text documents plays a preeminent role innumerous industrial and academic data mining applications. Textclassification has been explored for decades, and machine learningtechniques have proved successful for a variety of text classificationscenarios. In various embodiments, there are four possible textclassification setups, as described in the following paragraphs.

Eager learning from labeled documents—the most common setup, in which ageneric machine learning classifier is trained on a set of labeleddocuments.

Lazy learning from labeled documents-a case-based classifier (e.g.k-nearest neighbors) is build based on labeled documents. In variousembodiments, this lazy learning setup is fairly rare in the text domainas it is highly inefficient in many practical cases.

Eager learning from labeled features-usually coupled with learning fromlabeled documents.

Lazy learning from labeled features—an ensemble of decision—stump-likeclassifiers is constructed, each of which is triggered if a documentcontains a certain feature, such as a word or a phrase (i.e. an ngram ofwords). This setup is in the focus of this disclosure-we term itPhrase-Based Classification (PBC). PBC is very natural for multilabeltext classification (when each document may belong to a number ofclasses), as a number of decision stumps may be triggered for adocument. Another advantage of PBC is in explainability ofclassification results, which is an important feature of commercial textclassification systems: the fact that a document d is categorized into aclass c can be easily explained by the fact that d contains a phrase tthat was manually categorized into c.

PBC may not be the ultimate solution for every text classification task.In this disclosure, we characterize the class of tasks in which PBC maybe set for success. We introduce the property of near-sufficiency: theinformation about a document's class label is mostly concentrated in afew ngrams (usually noun phrases) occurring in the document.Intuitively, the near-sufficiency property holds if the categorizeddocuments are short (such as news headlines, helpdesk requests, Twittertweets, Quora questions, LinkedIn user profile summaries, etc)—justbecause short documents have little content beyond a few phrases. Still,longer texts can hold this property as well. An example may be tocategorize pieces of code according to the programming language thiscode was written in. Regardless of the code's length, just a fewkeywords may be enough to reveal the class label accurately.

The field of text classification in general, and PBC in particular, maybe affected by at least two factors, as described in the followingparagraphs.

A quantum leap in the data scale. The social media boom brings hugeamounts of textual content on the daily basis. Blogs, reviews, newsfeeds, and status updates (such as tweets) are often publicly availableand ready for mining. Analyzing hundreds of millions or even billionstext items is the modern reality. However, their vast majority mayunlabeled and thus intensive manual labeling may often be necessary.

Availability of cheap editorial resources (e.g., crowdsourcing).Companies such as Amazon (mturk.com), Samasource, CrowdFlower, andothers provide means for quickly labeling large-scale data.Crowdsourcing can also be used for evaluating classification results.The cost of labeling a data instance can be as low as a fraction of acent. In various embodiments, the quality of crowdsourcing results maybe quite low. Some workers may be underqualified, some simply may notstay in focus for long, and therefore their output may often be noisyand inconsistent. This may cause some frustration among researchers whoapply crowdsourcing to text classification. Nevertheless, manuallabeling of hundreds of thousands data instances (which was impracticaljust a few year ago) may now perfectly practical.

In various embodiments, we introduce Phrase-Based MultilabelClassification as a process consisting of the following steps: (a) givena dataset D and a set of classes C, construct a Controlled Vocabulary Vof words relevant for categorizing D into C; (b) from the dataset D,extract a set T of frequently used phrases that are composed out ofwords from V; (c) categorize phrases T into C (e.g., via crowdsourcing);(d) map each document d onto a subset of phrases from T and assign dinto a subset of classes based on classification of the phrases. Below,we describe the phrase-based classification in greater details.

Crowdsourcing classification of phrases may have at least two importantadvantages. First, humans can be very good at categorizing short piecesof text. If properly trained, human annotators can perform the taskaccurately and fairly quickly-mostly based on their common knowledge. Incontrast, to create a common knowledge base in an automaticclassification system may be difficult. Second, humans may be muchbetter than machines at multilabel classification. Questions such as“How many classes does a data instance belong to?”” or “Should this datainstance be assigned to any class?”may be relatively easy for a humanannotator.

We see at least three potential drawbacks of crowdsourcingclassification of phrases. First, crowdsourcing classification may notalways be consistent; thus, we may need to apply quite a heavy-liftconsistency check on the crowdsourcing results. Second, human annotatorsmay not be able to cope with a large number classes. (However, if theclassification is done for a user-facing product, a large number ofclasses may not be necessary because the end users may not be able tocope with them either.) Third, the number of phrases to be labeled mightbe overwhelmingly large and infeasible for human annotation, even in acrowdsourcing setup.

Below, we reveal that crowdsourcing classification of phrases mayactually be feasible in substantially all practical situations. Thenatural language may impose a (very restrictive) upper bound on theeffective number of phrases which can possibly be composed out of acontrolled vocabulary. Based on a language model estimated over a hugetext collection, we discover that the effective number of phrases growslinearly—rather than exponentially—in the vocabulary size. Thus, thelanguage model may allow extremely high-precision classification ofmultimillion document collections that obey the near-sufficiencyproperty.

As an example, phrase-based classification may be applied to the task ofJob Title Classification (e.g., on a social networking site, such asLinkedIn). Here, the task may include categorizing job titles of usersinto a couple of dozen classes according to a job function performed bythose titles' holders. Below, we describe an embodiment of the system.This embodiment may dramatically increase precision and coverage incomparison to other systems of job title classification. For example, anoffline evaluation shows that we achieve about 95% classificationprecision. Below, we compare our phrase-based classification system witha Support Vector Machine (SVM) classification system, and show that ourmethod may significantly outperform various SVM versions on a multilabeljob title classification task.

The PBC system may be successfully deployed with respect to an adtargeting product. Increasing the coverage may result in increasing theuser profile inventory available for targeting, which may have a directrevenue impact on the social networking site.

The sections below can be summarized as follows. First, we formalize thenear-sufficiency property of the data that guarantees high-precisionphrase-based classification (PBC). Second, we show that PBC will alwaysbe feasible, and not shy of dealing with multimillion and potentiallybillion data instance collections. Last, we present a working use caseof phrase-based classification on data associated with a socialnetworking site, and provide insights on building a successful PBCsystem.

Phrase-Based Classification—Problem Setup

In various embodiments, the problem of multilabel text classification isdefined as follows. Each document from an unlabeled corpus D is to becategorized into one or more classes from a class set C. More formally,a rule L:D→2^(C) is to be learned from training data. Also available isa labeled test set D_(test)=((d_(i),C*_(i))), where |D_(test)|<<|D| andeach C*_(i)⊂C is a set of d_(i)'s ground truth classes. Performance ofthe classification rule L is evaluated on the test set by comparing eachL(d_(i)) with C*_(i).

In classic text categorization, the rule L is learned given a trainingset D_(train) of documents and their labels. However, the categorizationframework does not restrict the training set to contain only documents.This disclosure deals with phrase-based classification (PBC)—a textclassification framework in which the training data consists of labeledphrases, or ngrams, extracted from the (unlabeled) documents. In thisdisclosure, the terms “phrase” and “ngram” are used interchangeably.

Feasibility of Phrase Labeling

Since our classification framework is heavily dependent on the manualprocess, we need to make sure that the process will be manageable in ageneric case: the number of ngrams to annotate will not exceed humancapabilities. We are only interested in annotating those ngrams that arefrequent enough in the language—otherwise we will lose generality of ourframework. (Obviously, we can artificially limit the number of ngrams tobe labeled. For example, we can take a list of only ten thousand mostfrequent ngrams. This will hurt the coverage though, as many documentswould not be mapped onto any ngram from the list.) Despite that thecombinatorial number of all possible ngrams made of words from avocabulary V is growing exponentially with n (the length of ngram), itis quite obvious that the language will not allow most of them to befrequently used. Let us show that the number of frequent enough ngramsis reasonably small in all practical settings. First, let us explainwhat we mean by “frequent enough” ngrams.

Definition 2. We define the frequency of a set X to be the frequency ofits least frequent item: F(X)=min_(x)∈_(X)F(x).

We say that a set T consists of frequent enough ngrams if F(T)≧F(V),i.e. the set of ngrams is as frequent as the vocabulary out of whichthose ngrams were composed. Here we solely aim at conditioning thefrequency of T on the frequency of the controlled vocabulary. We mighthave imposed a different condition, e.g. F(T)≧[F(V)/(α)] where α>1. Thiswould lead to similar results but make our reasoning more complicatedafter introducing an extra parameter α.

To show that the set T of frequent enough ngrams is feasibly small, wemake use of the Web 1T dataset. The Web 1T data was released by Googlein 2006. It consists of frequency counts of unigrams, bigrams, trigrams,fourgrams, and fivegrams, calculated over a dataset of one trilliontokens. The data is large enough to be a good model of a naturallanguage. Another example of a dataset that may be good model is theBooks ngram dataset. The unigram portion of the Web 1T data consists ofabout 7.3M words (we lowercased letters and ignored punctuation), withtheir frequency counts in a range from about 2*10¹⁰ down to 2*10².Needless to say, most of these words are not in our lexicons. In fact,only the upper 50K or so contain words that are in our everyday use.

We preprocessed the data by first omitting ngrams that appeared lessthan 1000 times in the Web 1T data, removing stopwords andalphabetically sorting words in each ngram (which leads to ignoring thewords' order). For example, given a fivegram “of words from thedictionary”, we removed stopwords of, from, and the and thenalphabetically sorted the remaining words into a bigram “dictionarywords”. We ended up with about 2.5M (distinct) unigrams, 13M bigrams,10M trigrams, 4M fourgrams, and 1.4M fivegrams.

To establish the upper bound on the number of frequent enough ngrams, wefirst ask the question what would be a vocabulary of size k out of whichthe maximal number of ngrams can be composed. Obviously, this is anNP-hard problem, as we would need to construct ((N₁) choose k)vocabularies, where N₁ is the number of unigrams in our data. We apply agreedy approximation. Let T=U(V) be the set of ngrams composed out of avocabulary V, where U is an ngram construction function, provided to usby the natural language. Let V=U⁻¹(T) be the vocabulary out of which theset of ngrams T is composed. First, we take a set T* of all the mostfrequent ngrams in the Web 1T data, such that |V*|=U⁻¹(T)|=k, i.e. theunderlying vocabulary V* is of size k. Second, we take a set

T*=U(V*) of all ngrams composed of V*, which is as frequent as V*itself, i.e. F(

T*)≧F(V*). This way, we assure that T*⊂

T*. The fact that ngram frequencies follow a power law—and that

T* contains all the most frequent ngrams—guarantees high quality of ourapproximation.

In FIG. 2, we plot |

T*| as a function of |V*|. As we can see on the plot, |

T*|=|V|_(β), where the exponent β is very small: 1≦β<1.3. This leads usto the conclusion that the language will only allow a reasonably smallnumber of frequent enough ngrams, to be composed out of vocabularies ofreasonably small sizes.

The procedure for creating a vocabulary V* described above isartificial, and quite unrealistic in practice. While this does notmatter for establishing the upper bound, we can tighten it if we apply amore natural scenario. We start with an observation that controlledvocabularies are usually built of words that come from the same topic(or a few topics). We simulate the creation of a topic-orientedcontrolled vocabulary V** in the following process: we first sample aword w₁ from a multinomial distribution defined over the set V_(m) of mmost frequent words in the Web 1T data. The word w₁ plays the role of aseed for the topic of V**. Then we add a second word w₂∈V_(m) that ismost co-occurring with w₁, and we add a third word w₃∈V_(m) that is mostco-occurring with {w₁,w₂}, and so on until we populate our vocabularyV** with k most co-occurring words. (The fact that the words in V**highly co-occur with each other assures that V** is built around aparticular topic which is more realistic than in our previous setup. Thefact that the words co-occur as much as possible allows us to establishthe upper bound on the number of ngrams composed out of V**.) We thenbuild the ngrams T**=U(V**), such that F(T**)≧F(V_(m))—the ngrams are asfrequent as the space of words from which V** was built (note thatF(V_(m))≦F(V**)). To overcome the non-determinism of seeding the topic,we repeat the process 10 times and report on the mean and standarderror. (The standard error is too small to be visible on our plots.)

We probe various sizes k of the controlled vocabulary, from 250 to16,000 words. This spans over arguably the entire spectrum ofpossibilities, as a set of 250 words appears to be too small for ahigh-quality controlled vocabulary, while a set of 16,000 is larger thanthe entire lexicon of an average language speaker. We also probedifferent values of m—the number of most frequent words from which V**is built. We tried four values of m: 50K, 100K, 150K, and 200K. We arguethat the utility of using m>200K is very low, as words in the tail ofthat list are completely obscure, and therefore useless from thegeneralization point of view.

Our results are shown in FIG. 2. We plot |T**| over |V**| and see thatit looks curvy on the log-log scale, which may imply that the number ofngrams is no longer super-linear in the number of underlying words. Toverify this claim, we expose an additional view of the same data: weplot γ=[(|T**|)/(|V**|)] over k=|V**|, and show that γ<200 for allrealistic values of k and m, which leads to the following formal result:

Corollary: The natural language imposes a linear dependency between thenumber of words in a controlled vocabulary V and the number of frequentenough ngrams composed out of words from V.

We observe an interesting saturation effect on the right plot: for largevocabularies (k>1000), the γ ratio decreases. We can explain this asfollows: high-frequency words are well represented in vocabularies ofany size. Low-frequency words, however, are mostly represented in largevocabularies. Ngrams that contain lower-frequency words are even lessfrequent-they fall below our threshold, which leads to fewer ngrams inT** per word in V**.

One could argue that since the Web 1T data does not contain frequenciesof high-order ngrams (above fivegrams), then the data is not completeand no conclusion can be made on the overall number of frequent enoughngrams composed out of words in V. While the data is not completeindeed, we are confident that it contains enough information to let usreason about the total size of T. FIG. 3 justifies our claim.

FIG. 3 illustrates an example 400 of a number of ngrams added afterevery processing step: first, after scanning Web 1T bigrams, then afterscanning trigrams, then fourgrams and fivegrams. As we see from theplots, the number of ngrams to be added goes down dramatically as thengrams' order goes up. It can be easily explained by the fact that theprobability of an ngram to be frequent enough decreases exponentiallywith the ngram's length. As the number of added ngrams is already smallat the fivegram landmark, we do not anticipate a significant addition tothe total size of T that would come from higher-order ngrams.

The first seven of the eight plots in FIG. 3 show the ngram additiondata over the same values of k as in FIG. 2. The last, eighth plot showsan example of a real-world V of size 2500—the one described below—withthe frequency constraint F(U(V))≧F(V). While the curve behaves verysimilarly to the previous seven curves, the overall number of ngrams isabout 3-4 times lower, as compared to the artificial setups with k=2000and k=4000. This implies that our upper bound is actually not verytight.

Example Architecture of a Phrase-Based Classification System

FIG. 4 illustrates an example embodiment of a system 502 forphrase-based classification, which may achieve very high precision andmay outperform popular classification methods. The system is composed oftwo modules: offline module 504 and online module 506. The offlinemodule 504 aims at constructing and labeling phrases which are then usedin the online module for categorizing documents. Clearly, the onlinemodule 506 is optimized for throughput such that most of the heavy dataprocessing is done in the offline module.

Offline Process of Phrase Classification

FIG. 5 illustrates an example method 600 of offline phraseclassification. The method may be implemented by the offline module 504of the phrase-based classification system. Let us walk through theoperations included in the example method 600 one-by-one.

Agree 602 upon the Taxonomy of Classes C. Each classification systemthat is built in response to a particular business need in a commercialorganization starts with defining the taxonomy of classes. Any textcorpus can be categorized into a variety of class taxonomies, howevernot all of them fit well the data at hand. If our data consists of, say,people's job titles, but our classes are models of Japanese cars, thenwe will fail to meet the system's coverage requirement. A good taxonomymay be constructed in a negotiation between business and technologygroups within the organization. Note that, in various embodiments, thesuccess of this step as well as most of the next steps may not beachieved without the involvement of a Content Manager-a domain expertwho is in charge of keeping the data organized and presented in the bestpossible way. In alternative embodiments, the success of this step maynot depend on the involvement of a Content Manager; for example, thisstep may be performed by various additional modules of the system.

The next step in the process is to Create 604 a Controlled Vocabulary Vof words that are supposed to characterize classes in the taxonomy.Controlled vocabularies are routinely used in computational linguisticsto investigate particular language patterns, as well as in libraryscience to organize and index knowledge. In our case, the creation of acontrolled vocabulary is a semi-manual feature selection process where,at first, words in the dataset are sorted in the descending order oftheir frequency, and then the content manager scans through the sortedlist and uses his/her domain expertise to decide whether a word can beincluded or not. Tools can be built to improve productivity of thiswork, given a particular task at hand.

Once the controlled vocabulary is created, the next step is to BuildPhrases T. We are only interested in ngrams of words from V thatfrequently reappear in the data (rare phrases are useless from thegeneralization point of view). We ignore the order of words in phrases,so that, for example, text classification and classification (of) textwill be considered the same phrase. (One may argue that in some casesthe order of words is actually crucial, e.g. Paris Hilton and Hilton,Paris are not the same. However, those cases are rare from thestatistical point of view, and therefore for all practical purposeswords' ordering can be ignored in text classification.) Above, we provedthat the language will not let the set T be too large for manuallabeling. The process of building the set T can be fully automatic: (a)extract all phrase candidates from the data D; (b) sort them byfrequency; (c) cut off the least frequent ones. We have to take care ofthe issues discussed in the following paragraphs

Filter out compound phrases. We can use NLP techniques (such as shallowparsing), to identify phrases that are frequent enough, but are clearlycomposed of two (or more) stand-alone phrases. For example, a phrase“Machine Learning or Information Retrieval” can be frequent enough, butkeeping machine learning and information retrieval in one phrase mightdistract our annotators, who may make an arbitrary decision whether thephrase belong to one class, two classes, or none.

Filter out too specific phrases. Some phrases, such as, e.g.“Content-Based Collaborative Filtering”, can be frequent enough in thedata but too specific for an annotator to understand their meaning. Itmakes more sense to remove such phrases given that a more general phrase(i.e. collaborative filtering, in this example) is in T.

Filter out near duplicate phrases. Some phrases, such as “RecommendationSystem” and “Recommender System”, are practically identical—not only intheir form but also in their meaning. We may use Levenshtein distance toidentify those pairs and then reason whether one of them can be mappedon the other. Some help from the content manager may be needed with thistask.

Now, when we have a feasible number of phrases, we can CrowdsourcePhrase Classification. Crowdsourcing sounds simple but it may notactually be simple. A straightforward approach would be to submit thetask to a crowdsourcing tool, asking the workers to assign each phraseinto zero, one, or more classes from the predefined set C. (Despite thatwe do not preserve words' ordering in the phrases, we need to presentthem to annotators in the form that is most commonly used in our data.)If priced properly, the task will be done quickly but chances are strongthat the results will be unsatisfactory (in some cases, practicallyrandom). The voting mechanism (i.e. let a number of workers annotateeach phrase and then take the majority vote) does not seem to help muchas the mode sometimes emerges from a seemingly uniform distribution. Themechanism of qualification tests (which a worker needs to pass in orderto take the task) is helpful though. The best results are achieved when(a) trust is established between the requester and every worker; and (b)every worker is well educated about the task to be performed.

As a byproduct of the classification process, workers can detectnonsensical phrases, as well as undesired words in the controlledvocabulary. Their feedback can be plugged back into the system andnecessary changes can be made.

Even when the workers are well trained for the task, the results arelikely to have two issues: systematic errors and consistency errors. Asystematic error occurs when the majority of workers make the sameincorrect decision. Systematic errors are hard to discover and fix. Oneway to reduce the systematic error rate is to choose workers ofdifferent backgrounds and geographic locations, and use the votingmechanism. Fortunately, systematic errors are not very common.

Consistency errors are more common, but also easier to tackle. Aconsistency error occurs when one choice is made for an item, and adifferent choice for a similar item. Consistency errors occur even whenthe voting mechanism is used. When the consensus of voters is weak, itis often enough for a single worker to make an inconsistent choice, inorder to introduce a consistency error. To dramatically reduce theconsistency error rate, it is desirable for each worker to perform theentire task. This increases the systematic error rate, but—again—thoseare quite rare. Even when each worker performs the entire task,consistency errors are common, primarily because people do not rememberall the choices they have made. Another approach to reduce theconsistency error rate is to present the workers with sets of similaritems. Even then, the results are rather inconsistent.

This observation leads us to the next step in the pipeline: CheckClassification Consistency. We represent each phrase t as a featurevector over the aggregated Bag-Of-Words of all documents mapped onto t.This representation is supposed to be dense enough for machine learningto be applied. If it is still too sparse, we can make use of additionaldata, such as, e.g., documents hyperlinked from our dataset D (e.g.shared links in tweets, etc).

Once represented as a feature vector, each phrase passes a consistencycheck, which utilizes a simple k-nearest neighbors (kNN) model: eachphrase t_(i) is assigned to a class c_(i) ^(knn) based on the classmajority vote of its neighbors in the vector space. Whenever c_(i)^(knn)∈C_(i), where C_(i) is a set of classes obtained for t_(i) via themanual classification, we consider the classification to be consistent.A more interesting case is when c_(i) ^(knn)∉C_(i). Crowdsourcing thetask of deciding between c_(i) ^(knn) and C_(i) is unlikely to producegood results, as people are bad at choosing between two legitimateoptions. (It may not be unusual for as many as 200 workers asked to takea test of 20 such cases to fail to pass a 75% accuracy barrier.)

We use an SVM model as the judge. We train it on the consistentlycategorized phrases and apply it to the unresolved cases. Since SVMmodel is fundamentally different from the kNN model (one is eagerlearning, another is lazy learning), we believe that the two models areunlikely to produce a systematic error. If the SVM judgement c_(i)^(svm) falls within {c_(i) ^(knn)}∪C_(i), we go with the SVM result. Ifnot, we declare the case too difficult and leave t_(i) uncategorized.

By this, we approach the last step of the pipeline: Finalize PhraseClassification. Even after such an elaborated error detection process,some problems may remain in the phrase classification results. It turnsout however that spot-checking the existing classification results is amuch easier task for a human than coming up with the originalclassification. The content manager can do a good job of fixing lastissues by simply scanning through the created phrase classification. Ofcourse, not all the issues can be detected this way. However, thisprocess can go on long after the classification system is deployed.Given that the phrase classification task is by no means large-scale, ithas the potential of eventually converging to the 100% precision.

Online Process of Document Classification

The online part of the PBC system is to map each document to a subset ofphrases from T, and then the document classes are obtained by a simplelook-up of classes of the corresponding phrases. Since the set ofphrases T was created out of most common phrases in the data D, manydocuments from D will be mapped onto at least one phrase. The mappingprocess is fully automatic, and fairly similar to the process ofcreating T. For each document d_(t), we (a) extract controlledvocabulary words V_(i)=V∩d_(i); (b) given V_(i), construct all possiblephrases T_(i)⊂T; and (c) filter some irrelevant phrases if any. Thelatter step would also incorporate some shallow NLP, for example, if thedocument is “Text Analytics and Image Classification”, then we must notmap it onto the phrase text classification. Note that before we startthe mapping process, we need to perform some preprocessing, such astaking care of word misspellings and abbreviations—to be mapped ontotheir full, correct form.

Another important aspect of the document-to-phrases mapping is that wewould prefer mapping documents onto the most specific phrases available(filtering in the offline process prevents us from getting toospecific). For example, if a document is mapped onto three phrases:text, classification, and text classification, then we would choose themore specific phrase (text classification) and filter out the others.This is done because the more specific the phrase is, the more precisedecision the annotator can make.

Job Title Classification System

Our use case for testing the proposed PBC methodology is classificationof users' job titles at a social networking site. On their professionalprofiles, users of the social networking site may be free to specifytheir job titles in any form they wish. In various embodiments, we foundalmost 40,000 different ways in which users specified a title “SoftwareEngineer”. FIG. 6 illustrates a few very common cases. (We did notinclude various subgroups of software engineers, nor softwaredevelopers/programmers, in this count.)

A example task we describe here is to categorize the job titles into afew dozens of classes according to job functions. Examples of theclasses can be Engineering, Marketing, Sales, Support, Healthcare,Legal, Education, etc. For the offline module (phrase classification),we may use a dataset of about 100M job titles. Obviously, many of thetitles repeat in the data, especially those generic titles such as“Manager”, “Director” etc. After we lowercase all titles and identifyduplicates, we create a dataset of unique titles which may turn out tobe about 5 times smaller than our original dataset. Let us first discussthe offline part of our PBC implementation.

Agree upon Taxonomy of Classes. The set of classes was provided by thebusiness owner which we accepted.

Create Controlled Vocabulary. There are three most important types ofwords in job titles: (a) job names, such as engineer, president,comedian etc; (b) seniority words, e.g. senior, junior, principal; and(c) function words, such as sales, research, financial etc. Senioritywords are the smallest bucket; function words are the largest bucket.Altogether, our controlled vocabulary consists of about 2500 words.Also, we create a translation look-up table, where abbreviations, commonmisspellings, and foreign language words are translated into thecontrolled vocabulary words.

Note that classification of foreign titles is quite straightforwardbecause our domain is very narrow: simple word-by-word translationusually creates a meaningful English phrase that annotators cansuccessfully categorize. Two corner cases that we may need to take careof are translation of one word into multiple words (typical in German),and translation of multiple words into one word (typical in French).

Build 606 Phrases. Titles get cleaned by filtering out words that do notbelong to our controlled vocabulary, and applying translations.Following our recipe described in more detail below, we split compoundtitles (such as “Owner/Operator”), get rid of too specific titles, anddeduplicate the resulting list. We then select a few dozens of thousandsmost common cleaned titles to be our set of phrases (we called themstandardized titles). Examples of standardized titles are softwareengineer, senior software engineer, software developer, java programmeretc. We verify that the list of standardized titles was comprehensiveenough to cover the spectrum from very common to relatively rare titles.

Classify 608 phrases. Classification of the standardized titles isanother step in the project. It may be important for us to perform thejob with the highest possible precision, so we may not to outsource it.Instead, we may chose the annotators to be employees of the socialnetworking site. To reduce a possible systematic error, we may need tohire annotators of diversified backgrounds. Together with that, we mayneed to minimize the effort of educating and training our annotators, sowe may organize them in groups. We create two groups of annotators intwo separate geographic regions, and ask each annotator to label theentire set of standardized titles. After the task is completed, we applythe voting mechanism. A certain percentage (e.g., 15%) of thestandardized titles may not gain the majority vote (e.g., no class maybe chosen for those titles by the majority of annotators). We thenperform two rounds of disambiguation to reach a consensus classificationof every standardized title in the list.

Check 610 Classification Consistency. As described in more detail below,we apply the kNN model. We may find that a certain percentage (e.g.,25%) of the standardized title classifications are not consistent withthe kNN model results. We may then apply the SVM model as a judge andresolve most of the inconsistency cases. We may leave uncategorized thecases when SVM produced a result that differs from both the kNN resultand the original manual classification result.

Finalize 612 Phrase Classification. The process of finalizing thestandardized title classification may be important for the success ofthe project. Spot-checking may reveal a number of systematic errors andconsistency errors. After the classification gets finalized, the newchallenge is to evaluate the standardized title classification results.We may not use our annotators for evaluating their own work. Therefore,we may hire a different group of annotators to evaluate theclassification results. This different group of annotators may evaluatea certain percentage (e.g., 25%) of the standardized titleclassifications and estimate the classification precision (e.g., 98%).

The online process of mapping original titles onto the standardizedtitles is fairly simple. For each original title, we first apply wordtranslations and filter out words that did not belong to our controlledvocabulary. Out of the remaining words, we compose all possiblestandardized titles from our list. We then apply three NLP heuristics toget rid of problematic mappings. For example, we avoid mapping acompound title “Senior Engineer/Project Manager” onto a standardizedtitle senior project manager. We also remove generic mappings if morespecific ones were available. For example, if a title is mapped ontoboth senior software engineer and software engineer, we remove thelatter. The resulting mapping may achieved a percentage of coverage(e.g., more than 95% coverage) on our dataset (e.g., of about 100Mtitles).

A challenging task may be to evaluate the precision of the mapping. Wemay use one of the commercial crowdsourcing providers for this task. Theprovider may evaluate a large set of mappings, where the evaluation setof original titles is sampled from the original title distribution (suchthat more common titles have a higher probability to get selected). Inembodiments in which there is a high level of duplications in our data,the evaluation set may cover a subset (e.g., about 40%) of all titles inour dataset. The evaluation may estimate the mapping precision (e.g., at97%).

Given the precision of standardized title classification (e.g., 98%),the lower bound on precision of the overall (original job title)classification process may be determined by multiplying theclassification precision with the mapping precision (e.g., 97%*98%=95%).It is a lower bound because an original title can be incorrectly mappedonto a standardized title that happens to belong to the same class asthe original title. The overall precision may then be analyzed todetermine whether the classification process achieves our precisiongoals. The coverage may also be analyzed. In various embodiments,particular coverage goals may not be met. For example, many standardizedtitles (usually generic ones, such as “Manager”) may not have anappropriate class in C. Obviously, coverage does not only depend on thesystem but also on the data and the taxonomy of classes, which makes thecoverage goals rather vulnerable. Higher coverage may be achieved eitherby introducing new data classes, or by mapping the input data to morespecific phrases-those that can be categorized.

Comparison with SVM

A legitimate question is how well would the traditional textclassification tools cope with the problem of job title classification.In various embodiments, after applying PBC classification, we may havean order of a hundred million job titles categorized with 95% precision.We can use any portion of this data to train a classifier and compareits results with those we obtained with PBC.

We use the SVM classifier for this comparison. Note that SVM is not astrawman—it is one of the best classification models available,particularly well suited for text. We set up a multilabel SVM framework.For example, we use SVMlight with parameters c=0.1 and j=2, which arechosen based on our prior knowledge.

We train an SVM on a portion of the job title data, then test it on theentire data and compare the classes assigned by the SVM with the classesassigned by PBC. If the SVM is within the 90 percentile from PBC, we cansay that the document-based classification results are comparable withthe phrase-based classification results. However, if the SVM results arefarther from those of PBC, we can claim that the SVM does a poor job(given that PBC is 95% accurate).

We propose two quality measures of the classification results: atolerant and an aggressive:

Partial Match. For each data instance d_(i), we choose a single classc_(i) ^(svm) that has the maximal confidence score among those obtainedby one-against-all binary classifiers. Note that the maximal score canactually be negative. Then we consider d_(i) to be successfullycategorized if c_(i) ^(svm)∈C_(i) (i.e. if c_(i) ^(svm) is among classesassigned by PBC). In this setup, we apply k+1 binary classifiers, wherethe extra classifier is for “No Class”—we compose an artificial class ofinstances that PBC did not assign to any class.

Full Match. For each data instance d_(i), we build a set C_(i) ^(svm) ofclasses that have positive confidence scores. We then consider d_(i) tobe successfully categorized if C_(i) ^(svm)=C_(i) (i.e. SVM classes andPBC classes are identical). In this setup, we do not need an extra classfor “No Class”.

We used the standard Precision, Recall, and F1 measures in both setups,as obtained on the entire data D, besides instances that were notcovered by PBC. Note that this way the SVM is evaluated on both thetraining and testing data simultaneously—and it is a rare case when thisis not a problem: since we start from scratch (no labeled data isinitially provided), it does not matter which instances are categorizedmanually and which automatically, as soon as the results can be fairlycompared. And we achieve fair comparison by training the SVM on the samenumber of instances as were manually labeled in the PBC process.

We tested four options for applying SVMs: (a) train on the set ofstandardized titles T, and test on D while taking into account onlyfeatures from T, that is, only our controlled vocabulary V; (b) thesame, but apply the word translation lookup table, for mapping someextra features of data D onto V; (c) train on the most common titles inD; (d) train on titles sampled from the title distribution over D. Thereason for choosing those options was that, in traditional textclassification, option (d) would be the default one, option (c) could beplausible, but options (a) and (b) would not be available. Our goal isto see whether the domain knowledge provided in V and the translationtable can improve the SVM classification results.

FIG. 7 illustrates an example summary 800 of findings. The Partial Matchresults may be very good, which proves again that the SVM classifier isstrong. The Full Match results may be much poorer. They suggest thatSVMs are not able to identify all the classes an instance can belong to,while phrase-based classification copes with this problem verygracefully. Among the four training options, option (a) turned to be theworst one, which can be easily explained by the fact that T (withoutword translations) had the minimal number of features among the foursetups. However, whenever word translations got added (option (b)), thissetup became the leader, which suggests that the domain knowledge ishelpful for SVM classification. Options (c) and (d) performed remarkablysimilarly, which was rather predictable, given that sampling from thetitle distribution (option (d)) would heavily prefer most common titles(option (c)).

Thus, we show that phrase-based classification (PBC) can achieveextremely high precision (e.g., 95%) with reasonable coverage (e.g.,80%) on a large-scale text collection (e.g., of over 100M instances). Wecharacterize the class of data on which PBC can be successful (the datathat satisfies the near-sufficiency property), and prove that PBC willbe feasible on data of virtually any size. The natural language mayprevent the annotation task from being overwhelmingly large. Thedevelopment cost of our deployed PBC system may be low (e.g., 2 personyears plus annotation expenses), and so is the maintenance cost (whichboils down to periodically updating the controlled dictionary as well asthe pool of categorized phrases). In various embodiments, within half ayear of no maintenance, the coverage of our job title classificationsystem may decrease by only a certain percentage (e.g., by 0.3%), whichshows its sustainability. Overall, the classification framework that canbe directly applied to other tasks, with success guaranteed.

An advantage of the deployed system may be in the right use of humanlabor: domain expertise of the content manager combined with commonknowledge of the crowd is leveraged in intuitive, single-action taskssuch as “choose a word”, “categorize a phrase”, and “check aphrase/category pair”.

We note that crowdsourcing is not a panacea, and is actually a much morecomplex task than it would have expected. Without a consistency check,the crowdsourcing classification results may be rather poor. Considerour job title classification use case, for which we hired two groups ofannotators. Since we now have all final results, we can see how welleach group did as compared to the final results. In various embodiments,it turns out that one group achieved 73.2% F-measure in the Full Matchsetup (see Section 6), while the other group achieved 76.7% F-measure.Together they achieved 79.3% F-measure. This implies that the two-stepconsistency check contributed over 20% to the F-measure, bringingsuccess to the entire framework.

FIG. 8 is a block diagram of a machine in the example form of a computersystem 900 within which instructions for causing the machine to performany one or more of the methodologies discussed herein may be executed.In alternative embodiments, the machine operates as a standalone deviceor may be connected (e.g., networked) to other machines. In a networkeddeployment, the machine may operate in the capacity of a server or aclient machine in a server-client network environment, or as a peermachine in a peer-to-peer (or distributed) network environment. Themachine may be a personal computer (PC), a tablet PC, a set-top box(STB), a Personal Digital Assistant (PDA), a cellular telephone, a webappliance, a network router, switch or bridge, or any machine capable ofexecuting instructions (sequential or otherwise) that specify actions tobe taken by that machine. Further, while only a single machine isillustrated, the term “machine” shall also be taken to include anycollection of machines that individually or jointly execute a set (ormultiple sets) of instructions to perform any one or more of themethodologies discussed herein.

The example computer system 900 includes a processor 902 (e.g., acentral processing unit (CPU), a graphics processing unit (GPU) orboth), a main memory 904 and a static memory 906, which communicate witheach other via a bus 908. The computer system 900 may further include avideo display unit 910 (e.g., a liquid crystal display (LCD) or acathode ray tube (CRT)). The computer system 900 also includes analphanumeric input device 912 (e.g., a keyboard), a user interface (UI)navigation (or cursor control) device 914 (e.g., a mouse), a storageunit 916, a signal generation device 918 (e.g., a speaker) and a networkinterface device 920.

The disk drive unit 916 includes a machine-readable medium 922 on whichis stored one or more sets of data structures and instructions 924(e.g., software) embodying or utilized by any one or more of themethodologies or functions described herein. The instructions 924 mayalso reside, completely or at least partially, within the main memory904 and/or within the processor 902 during execution thereof by thecomputer system 900, the main memory 904 and the processor 902 alsoconstituting machine-readable media. The instructions 924 may alsoreside, completely or at least partially, within the static memory 906.

While the machine-readable medium 922 is shown in an example embodimentto be a single medium, the term “machine-readable medium” may include asingle medium or multiple media (e.g., a centralized or distributeddatabase, and/or associated caches and servers) that store the one ormore instructions or data structures. The term “machine-readable medium”shall also be taken to include any tangible medium that is capable ofstoring, encoding or carrying instructions for execution by the machineand that cause the machine to perform any one or more of themethodologies of the present embodiments, or that is capable of storing,encoding or carrying data structures utilized by or associated with suchinstructions. The term “machine-readable medium” shall accordingly betaken to include, but not be limited to, solid-state memories, andoptical and magnetic media. Specific examples of machine-readable mediainclude non-volatile memory, including by way of example semiconductormemory devices, e.g., Erasable Programmable Read-Only Memory (EPROM),Electrically Erasable Programmable Read-Only Memory (EEPROM), and flashmemory devices; magnetic disks such as internal hard disks and removabledisks; magneto-optical disks; and compact disc-read-only memory (CD-ROM)and digital versatile disc (or digital video disc) read-only memory(DVD-ROM) disks.

The instructions 924 may further be transmitted or received over acommunications network 926 using a transmission medium. The network 926may be one of the networks 920. The instructions 924 may be transmittedusing the network interface device 920 and any one of a number ofwell-known transfer protocols (e.g., Hyper Text Transfer Protocol(HTTP)). Examples of communication networks include a local area network(“LAN”), a wide area network (“WAN”), the Internet, mobile telephonenetworks, Plain Old Telephone (POTS) networks, and wireless datanetworks (e.g., WiFi and WiMax networks). The term “transmission medium”shall be taken to include any intangible medium that is capable ofstoring, encoding or carrying instructions for execution by the machine,and includes digital or analog communications signals or otherintangible media to facilitate communication of such software.

Although an embodiment has been described with reference to specificexample embodiments, it will be evident that various modifications andchanges may be made to these embodiments without departing from thebroader spirit and scope of the invention. Accordingly, thespecification and drawings are to be regarded in an illustrative ratherthan a restrictive sense. The accompanying drawings that form a parthereof, show by way of illustration, and not of limitation, specificembodiments in which the subject matter may be practiced. Theembodiments illustrated are described in sufficient detail to enablethose skilled in the art to practice the teachings disclosed herein.Other embodiments may be utilized and derived therefrom, such thatstructural and logical substitutions and changes may be made withoutdeparting from the scope of this disclosure. This Detailed Description,therefore, is not to be taken in a limiting sense, and the scope ofvarious embodiments is defined only by the appended claims, along withthe full range of equivalents to which such claims are entitled.

Such embodiments of the inventive subject matter may be referred toherein, individually and/or collectively, by the term “invention” merelyfor convenience and without intending to voluntarily limit the scope ofthis application to any single invention or inventive concept if morethan one is in fact disclosed. Thus, although specific embodiments havebeen illustrated and described herein, it should be appreciated that anyarrangement calculated to achieve the same purpose may be substitutedfor the specific embodiments shown. This disclosure is intended to coverany and all adaptations or variations of various embodiments.Combinations of the above embodiments, and other embodiments notspecifically described herein, will be apparent to those of skill in theart upon reviewing the above description.

1. (canceled)
 2. A system comprising: one or more processors; one ormore modules implemented by the one or more processors, the one or moremodules configured to, at least: construct a controlled vocabulary ofwords relevant for categorizing a set of data items into a set ofclasses; extract sets of phrases from the set of data items that arecomposed of words in the controlled vocabulary; classify differentcombinations of the sets of phrases into different classes of the set ofclasses; map each data item of the set of data items onto the set ofclasses based on a correspondence between sets of phrases in the dataitem to the different combinations of the sets of phrases extracted fromthe set of data items; and customize a service of a back-end systembased on the mapping of each data item to the set of data items.
 3. Thesystem of claim 2, wherein the extracting of the sets of phrases fromthe controlled vocabulary includes selecting a predetermined percentageof the most common data items included in the set of data items.
 4. Thesystem of claim 2, wherein the classifying of the different combinationsof the sets of phrases into different classes of the set of classesincludes receiving input from a crowdsourcing system.
 5. The system ofclaim 2, wherein the mapping of each data item of the set of data itemsonto the set of classes includes removing words from each data item thatare not included in the controlled vocabulary.
 6. The system of claim 2,wherein the different combinations of the sets of phrases include allpossible phrases used in the different classes.
 7. The system of claim2, wherein the set of data items is a set of user-specified job titlesand the set of classes is a set of standardized job titles.
 8. Thesystem of claim 7, wherein the service is a targeted advertising serviceand wherein the customizing of the service includes generating targetedadvertising for a user based on the mapping.
 9. A method comprising:constructing a controlled vocabulary of words relevant for categorizinga set of data items into a set of classes; extracting sets of phrasesfrom the set of data items that are composed of words in the controlledvocabulary; classifying different combinations of the sets of phrasesinto different classes of the set of classes; mapping each data item ofthe set of data items onto the set of classes based on a correspondencebetween sets of phrases in the data item to the different combinationsof the sets of phrases extracted from the set of data items; andcustomizing a service of a back-end system based on the mapping of eachdata item to the set of data items.
 10. The method of claim 9, whereinthe extracting of the sets of phrases from the controlled vocabularyincludes selecting a predetermined percentage of the most common dataitems included in the set of data items.
 11. The method of claim 9,wherein the classifying of the different combinations of the sets ofphrases into different classes of the set of classes includes receivinginput from a crowdsourcing system.
 12. The method of claim 9, whereinthe mapping of each data item of the set of data items onto the set ofclasses includes removing words from each data item that are notincluded in the controlled vocabulary.
 13. The method of claim 9,wherein the different combinations of the sets of phrases include allpossible phrases used in the different classes.
 14. The method of claim9, wherein the set of data items is a set of user-specified job titlesand the set of classes is a set of standardized job titles.
 15. Thesystem of claim 14, wherein the service is a targeted advertisingservice and wherein the customizing of the service includes generatingtargeted advertising for a user based on the mapping.
 16. Anon-transitory machine-readable medium embodying a set of instructionsthat, when executed by a processor, cause the processor to performoperations, the operations comprising: constructing a controlledvocabulary of words relevant for categorizing a set of data items into aset of classes; extracting sets of phrases from the set of data itemsthat are composed of words in the controlled vocabulary; classifyingdifferent combinations of the sets of phrases into different classes ofthe set of classes; mapping each data item of the set of data items ontothe set of classes based on a correspondence between sets of phrases inthe data item to the different combinations of the sets of phrasesextracted from the set of data items; and customizing a service of aback-end system based on the mapping of each data item to the set ofdata items.
 17. The non-transitory machine-readable medium of claim 16,wherein the extracting of the sets of phrases from the controlledvocabulary includes selecting a predetermined percentage of the mostcommon data items included in the set of data items.
 18. Thenon-transitory machine-readable medium of claim 16, wherein theclassifying of the different combinations of the sets of phrases intodifferent classes of the set of classes includes receiving input from acrowdsourcing system.
 19. The non-transitory machine-readable medium ofclaim 16, wherein the mapping of each data item of the set of data itemsonto the set of classes includes removing words from each data item thatare not included in the controlled vocabulary.
 20. The non-transitorymachine-readable medium of claim 16, wherein the different combinationsof the sets of phrases include all possible phrases used in thedifferent classes.